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is 4 BACKGROUND OF THE INVENTION 



16 4.1 FIELD OF THE INVENTION 

17 The invention relates to a common component of: 

18 • Speech Recognition, more particularly to the fields of Keyword Spotting and decoding, 

19 • Segments Alignment for DNA and proteins, 

20 • Recognition of Objects in Images, 

21 4.2 DESCRIPTION OF THE RELATED ART 

22 This invention addresses the problem of keyword spotting (KWS) in unconstrained speech 

23 without explicit modeling of non-keyword segments (typically done by using filler HMM 

24 models or an ergodic HMM composed of context dependent or independent phone models 

25 without lexical constraints). Several methods (sometimes referred to as "sliding model meth- 

26 ods") tackling this type of problem have already been proposed in the past. E.g., they use 
.27 Dynamic Time Warping (DTW) or Viterbi matching allowing relaxation of the (begin and 

28 endpoint) constraints. These are known to require the use of an "appropriate" norrnaliza- 

29 tion of the matching scores since segments of different lengths have then to be compared. 

30 However, given this normalization and the relaxation of begin/endpoints, straightforward 

31 Dynamic Programming (DP) is no longer optimal (or, in other words, the DP optimality 

32 principle is no longer valid) and has to be adapted, involving more memory and CPU. In- 

33 deed, at any possible ending time e, the match score of the best warp and start time b of 

34 the reference has to be computed (for all possible start times b associated with unpruned 
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35 paths). Finally, this adapted DP quickly becomes even more complex (or intractable) for 

36 more advanced scoring criteria (such as the confidence measures mentioned below). 

37 Work in the field of confidence level, and in the framework of hybrid HMM/ANN systems 

38 has shown that the use of accumulated local posterior probabilities (as obtained at the 

39 output of a multilayer perceptron) normalized by the length of the word segment (or, better, 

40 involving a double normalization over the number of phones and the number of acoustic 

41 frames in each phone) was yielding good confidence measures and good scores for the re- 

42 estimation of TV-best hypotheses. However, so far the evaluation of such confidence measures 

43 involved the estimation. and rescoring of N-best hypotheses. 

44 KWS methods without filler models have in common the selection of a subsequence of 

45 the utterance to match the interesting keyword models. Let X = {xi,x'2 ; . . . ,x nj . . . ,x^} 

46 denote the sequence of acoustic vectors in which we want to detect a keyword, and let M 

47 be the HMM model of a keyword M and consisting of L states Q = {91,(72, • ■ • > ft) • ■ • 

48 Assuming that M is matched to a subsequence X§ = {xj,, . . . ,x e } (1 < b < e < N) of X, 

49 and that we have an implicit (not modeled) garbage/filler state qc preceding and following 

50 M, one can define (approximate) the log posterior of a model M given a subsequence Xf; as 

51 the average posterior probability along the optimal path, i.e.: 

52 -logP(MTO * —l-j mi„ -logP( 0ra 

53 ^j-L-^n {-\o g P( Q b \ qG ) 

54 - E[logP(g"|x„) + log P(q n+1 \q n )} 

n—b 

55 -\ogP(q e \x e )-\ogP(q G \q e )} (1) 

56 where Q = {q b , q b+l , q e } represents one of the possible paths of length (e — b+ 1) in M, and 
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57 q n the HMM state visited at time n along Q, with q n € Q. In this expression, q G represents 

58 the "garbage" (filler) state which is simply used here as the non-emitting initial and final 

59 state of M. Transition probabilities P{q b \qc) and P(qo\q c ) can be interpreted as the keyword 

60 entrance and exit penalties, but can be simply set to 1. Local posteriors P{qi\x n ) can be 

61 estimated using any of the known techniques: multi-gaussians, code-books, or as output 

62 values of a multilayer perceptron (MLP) used in hybrid HMM/ ANN systems. For a specific 

63 sub- sequence X b e , expression (1) can easily be estimated by dynamic programming since the 

64 sub-sequence and the associated normalizing factor (e — b + 1) are given. However, in the 

65 case of keyword spotting, this expression should be estimated for all possible begin/endpoint 

66 pairs {6, e} (as well as for all possible word models), and we define the matching score of X 

67 on M as: 

68 S{M\X) = ~\og P{M\Xf) (2) 

69 where the optimal begin/endpoints {6*,e*}, and the associated optimal path Q*, are the 

70 ones yielding the lowest average local posterior: 

(Q*,6*,e*> - argmin " 1 \og P(Q\X e b ) (3) 
fL {QM e-6 + 1 

72 Of course, in the case of several keywords, all possible models will have to be evaluated. 

73 A double averaging involving the number of frames per phone and the number of phones 

74 usually yields slightly better performance when used to rescore N-best candidates: 

75 (Q*,b*,e*)= (4) 

-1 J ( 1 ej \ 
argmin — £ r— r Y, lo S p (9"kn) nonumber (5) 

76 {QM) J j=l \ e J ~ h i + 1 n=bj J 

77 where J represents the number of phones in the hypothesized keyword model and qj the 
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78 hypothesized phone qj for input frame x n . However, given the time normalization and 

79 the relaxation of begin/endpoints, straightforward DP is no longer optimal and has to be 

80 adapted, usually involving more memory and CPU. 

81 Filler-based KWS need a simpler decoding step. Although various solutions have been 

82 proposed towards the direct optimization of (2), most of the keyword spotting approaches 

83 today prefer to preserve the optimality and simplicity of Viterbi DP by modeling the complete 

84 input and explicitly or implicitly modeling non-keyword segments by using so called filler or 

85 garbage models as additional reference models. In this case, we assume that non-keyword 

86 segments are modeled by extraneous garbage models/states qc (and grammatical constraints 

87 ruling the possible key word/non- key word sequences). 

88 {it is sufficient to consider only the case of detecting one keyword} = Let 

89 us consider only the case of detecting one keyword^ per utterance at a time. In this case. 

90 the keyword spotting problem amounts at matching the whole sequence X of length N onto 

91 an extended HMM model M consisting of the states {<?<?> 9i» ?g}j in which a path 

b-l N-e 

92 (of length N) is denoted Q = {qG7^QG^Q b i Q b+l ^ •■•> 9 e > Qg^Qg} with (b - 1) garbage states 

93 qc preceding q b and (N — e) states qc following <7 e , and respectively emitting the vector 

94 sequences Xi~ l and X^ x associated with the non-keyword segments. 

95 Given some estimation of P(qc\x n ) (e.g., using probability density functions trained on 

96 non keyword utterances), the optimal path Q* (and, consequently b* and e*) is then given 

97 by: 

qq ' ^=argmin-logP(Q|X) 

y0 VQeAf 

= argmiii{- log P(QW) 

yy VQ6M 
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100 . -X>gP(</ G |z»)- £ \ogP(q G \x n )} (6) 

n=l n=e+l 

101 which can be solved by straightforward DP (since all paths have the same length). The main 

102 problem of filler-based keyword spotting approaches is then to find ways to best estimate 

103 P{qc\^n) in order to minimize the error introduced by the approximations. Sometimes this 

104 value was defined as the average of the N best local scores while, in other approaches, this 

105 value is generated from explicit filler HMMs. However, these approaches will usually not 

106 lead to the "optimal" solution given by (2). 



107 5 BRIEF SUMMARY OF THE INVENTION 

108 The invention belongs to the technical domain of decoding, classification, alignment and 

109 matching of data. 

110 The invention introduces a new method performing tasks in keyword spotting in utter- 
Ill ances, detection of subsequences in chains of organic matter (DNA and proteins) and recog- 

112 nition of objects in images. The proposed methods search in an optimized way the matching 

113 that maximizes, over all the possible matchings, certain confidence measures based on nor- 

114 malized posteriors. Three such confidence measures are used, two existed in previous work 

115 in Speech Recognition, and the third one is a new one. 

116 Application fields for this invention are: man-machine interfaces (using speech recogni- 

117 tion; ex: control systems, banking, flight services, etc), coordination systems (for industrial 

118 robots and automata) and development systems for pharmaceutic products. 
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119 6 BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE 

120 DRAWINGS 

121 Not Applicable 

122 7 DETAILED DESCRIPTION OF THE INVENTION 

123 [The present invention introduces a fast iterative method J _ In the following, we 

124 show that it is possible to define an iterative process, = referred to as Iterating Viterbi De- 

125 coding (IVD) with good/fast convergence properties, estimating the value of P(q G \x n ) such 

126 that straightforward DP (6) yields exactly the same segmentation (and recognition results) 

127 than (3). While the same result could be achieved through a modified DP in which all pos- 

128 sible combinations (all possible begin/endpoints) would be taken into account, the method 

129 proposed below is much more efficient (in terms of both CPU and memory requirements). 

130 Compared to previously devised "sliding model" methods the first method proposed here 

131 is based on: 

132 1. A matching score defined as the average observation probability (posterior) along the 

133 most likely state sequence. It is indeed believed that local posteriors are more appro- 

134 priate to the task. 

135 2. The iteration of a Viterbi decoding algorithm, which does not require scoring for all 

136 begin/endpoints or N-best rescoring, and which can be proved to (quickly) converge to 

137 the "optimal" (from the point of view of the chosen scoring functions) solution without 





138 



requiring any specific filler models, using straightforward Viterbi alignments (similar 
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to regular filler-based KWS, but for some versions at the cost of a few iterations). 



140 The IVD method is based on a similar criterion as the filler based approaches (6), but 

141 rather than looking for explicit (and empirical) estimates of P{qc\%n) we aim at mathe- 

142 matically estimating its value (which will be different and adapted to each utterance) such 

143 that solving (6) is equivalent to solving (3). Thus, we perform an iterative estimation of 

144 P{qa\zn), such that the segmentation resulting of (6) is the same than what would be ob- 

145 tained from (3). Defining e t = — \ogP(qc\x n ) at iteration t, the proposed method can be 

146 summarized as follows: 

147 1. Start the first iteration, t = 0, from an initial value £q = U (it is actually proven that 

148 the iterative process presented here will always converge to the same solution, in more 

149 or less cycles with the worst case upper bound of N iterations, independently of this 

150 initialization, e.g., with II equal with a cheap estimation of the score of a "match"). 

151 In one of the developed versions, So is initialized to — log of the maximum of the local 

152 probabilities P{qk\x n ) for each frame x n . 

153 An alternative choice is to initialize e 0 to a pre-defined threshold score. T, that expres- 

154 sion (1) should reach to declare a keyword "matching" (see step 4 below). In this last 

155 case, if 6i > e 0 at the first iteration, then we can (as proven) directly infer that the 

156 match will be rejected, otherwise it will be accepted. 

157 2. Given the estimate e t of P(qc\x n ) at current iteration £, find the optimal path (Q t , b tl e t ) 

158 according to (6) and matching the complete input. 



8 



159 3. Estimate the value of et+\ to be used in the next iteration as the average of the local 

160 posteriors along the optimal path Q t (matching the X\\ resulting of (6) on the keyword 

161 model) i.e.: 

162 - = -(i^TT)'°^«3.l^) P) 

163 4. Increment t and return to (2) iterating until convergence is detected. If we are not 

164 interested in the optimal segmentation, this process could also be stopped as soon as it 

165 reaches a e £ +i lower than a (pre-defined) minimum threshold, T, below which we can 

166 declare that a keyword has been detected. 

167 Correctness and convergence proof of this process and generalization to other criteria, are 

168 available: each IVD iteration (from the second iteration) will decrease the value of e u and the 

169 final path yields the same solution than (3). The above method has a very good experimental 

170 convergence speed (3-5 iterations in our tests). For one version of IVD (when so is initialized 

171 using the acceptance threshold, T), the detection is decided after one single step. 

172 A version with the same effort but suboptimal results is proposed in the following para- 

173 graph. Let T(M i X) be a matrix holding the HMM emission probabilities for an utterance 

174 A' whose time-frames define the columns, and where the states of the hypothesized word 

175 W define the rows. When using the standard DP, one computes for each element of the 

176 matrix T(M,X) at frame k of X and state s of M three values: S ks , L ks and C ks . where 

177 S k$ corresponds to the sum of the entries on the optimal path that leads to the entry, L ks 

178 holds the length of the optimal path computed so far, and C ks is the estimation of the cost 

179 on the optimal expanded path. By a path leading to an entry T(k ) s) we mean a sequence 

180 of entries in the table T, such that there is exactly an entry for each time frame t<k. At 
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181 each entry T(k, s). DP selects a locally optimal path noted P ks . At each step k t we consider 

182 all pairs of entries of table T(M,X) of type T(k>s), T(k - l,t). We update for each such 

183 pair, the current cost C ks (initially oo), by comparing it with the alternative given by: 

Ska = S [k -\)t ~ \ogp{s\x k )p{s\t) 

184 

L h8 = L (k -< i)t + l i \/t>0,t<L 

185 



186 C ka -— (8) 



187 wanting to have at step k the path P ks from the paths P( k -i)t that minimizes C NL . With 

188 DP. one will choose the P ks with minimal C ks . 

189 This version can yield suboptimal results since the optimality principle is not respected 

190 by the expression 8. The optimality principle of Dynamic Programming requires that the 

191 path to the frame k - 1 that minimizes C NLy also minimizes C ks for an entry at frame k of 

192 table T(M,X). 

193 Another technique that is suboptimal in time and/or quality is obtained from the previous 

194 one adopting a beam-search approach and a set of safe prunings. The Dynamic Programming 

195 can be viewed as a set of safe prunings that are applied at each entry of the DP table and 

196 has the property that only one alternative is maintained. Dynamic Programming cannot be 

197 used, since the principle of optimality is not respected. The following types of safe pruning 

198 that can be done are introduced by the present invention. Within the current invention we 

199 found a set of safe prunings as follows: we have proved that if at a frame a we have two paths 

200 P' a and P^ with S£ < S' a and L' a < L" a , then at no frame c>a will a path P" be forsaken for 

201 a path P' c if P^CP^ PaC p " and P c\K= P c\ P a- We will note the order relation as P'{<P' a . 
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202 We have further shown that a path P' may be safely discarded only when we know a lower 

203 cost one,* P". 

204 P'<P" ^C k <Cl (9) 

205 Thus, the method described in following method computes S(M y X) and Q* from equa- 

206 tion (3). By ordering the set of paths, according to Equation 9, we only need to check the 

207 step (1.1) of the following method up to the eventual insertion place. The last paths are 

208 candidates for pruning in step (1.2). In order for the pruning to be acceptable, we will prune 

209 only paths that were too long on the last state. An additional counter for each path is 

210 needed for storing the state length. This counter is reset when an entry from another row 

211 is added and is incremented at each advance with a frame. The following steps detail this 

212 method for a model W and an utterance X: 

213 a) Initialize all elements of a matrix, SetOfPaths(l..N, 1..K), to 0 

214 b) For all frames from 1 to N, for all states from 1 to K, for all candidates p l in 

215 SetOfPaths (frame.- 1, 1..K): 

216 - For all pj in SetOfPaths[frame, state], if Pi<Pj then delete pj (1.1), and if Pj<p% 

217 then continue step b) (1.2) 

218 - Insert pi in SetOfPaths[frame, state] 

219 c) Select SetOfPaths[frame, K] as the best of the candidates 

220 The next method builds on the previous technique and is a fast procedure for maximizing 

221 a more complex confidence measure that yields better results in practice. The corresponding 
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222 confidence measure is defined as: 
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NVP 



1 



/t,6 VP 



length(hi) 



(10) 



224 where NVP stands for the number of visited phonemes and VP stands for the set of visited 

225 phonemes. An average is computed over all posteriors pst of the emission probabilities for the 

226 time frames matched to the visited phoneme hi. The function length(hi) gives the number of 

227 time frames matched against hi. This method uses a breath first Beam Search algorithm. It 

228 exploits a set of reduction rules and certain normalizations. For the state qc, in this method, 

229 the logarithm of the emission posterior is equal with zero. For each frame e and for each 

230 state 5, the set of paths/probabilities of having the frame e in the state s is computed as 

231 the first M maxima' (Af can be finite) of the confidence measure for all paths in HMM M of 

232 length e and ending in the state s. The paths that according to the reduction rules will loose 

233 the final race when compared with another already known path, will be deleted as well. Let 

234 us note u u p u l u respectively a 2: P2 and I2 the confidence measure for the previously visited 

235 phonemes, the posterior in the current phoneme and the length in the current phoneme for 

236 the path Q u respectively the path Q 2 . The rules that can be used for the reduction of the 

237 search space by discarding a path Qi for a path Q 2 are in this case any of the next ones: 



238 1. l 2 >l u A > 0, B < 0 and L\A + L C B + C > 0 

239 2. l 2 >l u A > 0, B > 0 and C > 0 

240 3. l 2 >l u A < 0, C > 0 and L 2 A + LB + C > 0 

241 4. / 2 >/ 1; A = 0, B < 0 and LB + C > 0 
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242 where A = a x - a 2 , B = (ax - a 2 )(/i + / 2 ) +Pi - P2, C=(ai - a 2 )M 2 + Pi^2 - Pih, L = 

243 L max - max{/i,J 2 }, £ c — -B/2A > 0 and L mai is the maximum acceptable length for a 

244 phoneme. By discarding paths only if one of the above rules is satisfied, the optimum defined 

245 by the confidence measure with double normalization can be guaranteed, if no phone may be 

246 avoided by the HMM M. Any HMM may be decomposed in HMMs with this quality. The 

247 4-th rule is included in the 3-rd and its test is useless if the last one was already checked. 

248 The first test, Z 2 > £1 tells us if Q 2 has chances to eliminate Q u otherwise we will check 

249 if Q\ eliminates Q 2 . These tests were inferred from the conditions of maintaining the final 

250 maximal confidence measure while reduction takes place. In order to use the method of 

251 double normalization without decomposing HMMs that skip some phonemes, the previous 

252 rules are modified taking into account the number of visited phonemes for any path F x 

253 respectively F 2 and the number of phonemes that may follow the current state. A simplified 

254 test can be: 

255 • l 2 > lu A > 0, pi > p 2 respectively F 2 >F X for the HMMs that skips phonemes. 

256 This test is weaker than the 2 nd reduction rule. For example a path is eliminated by a second 

257 path if the first one has an inferior confidence measure (higher in value) for the the previous 

258 phonemes, a shorter length and the minus of the logarithm of the cumulated posterior in 

259 the current phoneme also inferior (higher in value) to that of the second one. An additional 

260 confidence measure based on the maximal length, L max , and on the maximum of the minus 

261 of the logarithm of the cumulated and normalized posterior in phoneme, Pmax, can be used 

262 in order to limit the number of stored paths. 

263 • p > L max P max in any state 
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264 • 2 > p max at the output from a phoneme 

265 where p and 1 are the values in the current phoneme for the minus of the logarithm of 

266 cumulated posterior and for the length of the path that is discarded. These tests allow for 

267 the elimination of the paths that are too long without being outstanding, respectively of 

268 the paths with phonemes having unacceptable scores, otherwise compensated by very good 

269 scores in other phonemes. If Af is chosen equal with one, the aforementioned rules are no 

270 longer needed, but always we propagate the path with the maximal current estimation of 

271 the confidence measure. The obtained results are very good, even if the defined optimum is 

272 guaranteed for this method only when Af is bigger than the length of the sequence allowed 

273 by L max or of the tested sequence. The same approach is valid for the simple normalization, 

274 where the HMM for the searched word will be grouped into a single phoneme. 

275 The present invention can exploit a newly designed a confidence measure, version named 

276 "Real Fitting", that represents differently the exigencies of the recognition. Since the 

277 phonemes and the absent states can be modeled by the used HMMs, we find it interest- 

278 ing to request the fitting of each phoneme in the model with a section of the sequence. 

279 Therefore, we measure the confidence level of a subsequence as being equal with the max- 

280 imum over all phonemes of the minus of the logarithm of the cumulated posterior of the 

281 phone, normalized with its length: 

Y,phonem - l ogjposteriors) 
ono max ; ; — n U 1 ) 

phoneme Visited Phonems pflOUem length 

283 The rule that may be used in this framework for the reduction of the number of visited paths 

284 is: 

285 • Q 2 is discarded in favor of another path Qi if the confidence measure of the Real 
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Fitting for the previous phonemes is inferior (higher in value) for Q 2 compared with 
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Qi, and if p } < p 2 and l 2 < l\. 



288 where p u l u respectively p 2 , l 2 represent the minus of the logarithm of the cumulated poste- 

289 rior respectively the number of frames in the current phoneme for the path Q } respectively 

290 Q 2 . Similarly to the previous method, the set of visited paths can be pruned by discarding 

291 those where: 

292 • p > L max P max in any state 

293 • 2 > p max a t the output from a phoneme 

294 where p and 1 are the values in the current phoneme for the minus of the logarithm of the 

295 cumulated posterior and for the length of the path that is discarded. We recall that the 

296 meaning of the constants are the maximal length L maX) respectively the accepted maxima 

297 of the minus of the logarithm of the cumulated and normalized posterior in phoneme, P ma x- 

298 This invention thus proposes a new method for keyword spotting, based on recent ad- 

299 vances in confidence measures, using local posterior probabilities, but without requiring the 

300 explicit use of filler models. A new method, referred to as Iterating Viterbi Decoding (IVD), 

301 to solve the above optimization problem with a simple DP process (not requiring to store 

302 pointers and scores for all possible ending and start times). Other three new beam-search 

303 algorithms corresponding to three different confidence measures are also proposed. 

304 To summarize, the object of the invention consists of: 

305 • Method of recognition of a subsequence using a direct maximization of confidence 

306 measures. 




307 • The method of IVD for directly maximizing the confidence measures based on simple 

308 normalization. 

309 • The use of the confidence measure and method of recognition named 'Real Fitting', 

310 based on individual fitting for each phoneme. 

311 • Methods of recognition using simple and double normalization by: 

312 • combining these measures with additional confidence measures mentioned here, respec- 

313 tively the maximal length and real matching limitation. 

314 • The use of the aforementioned methods in keyword recognition. 

315 • The use of the aforementioned methods in subsequence recognition of organic matter. 

316 • The use of the aforementioned methods in recognition of objects in images. 

317 DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS 

318 Execution: The method can be performed using a personal computer or can be imple- 

319 mented in specialized hardware. 

320 1. A representation under the form of an HMM is obtained for the subsequences that are 

321 looked for (word, protein profile, section of an image of the object). 

322 2. A tool will be obtained (eventually trained Ex: for speech recognition) for the esti- 

323 mation of the posteriors. For example "multi-Gaussians, neuronal networks, clusters, 

324 database with Generalized Profiles and mutation matrices (PAM, BLOSSUM, etc.). 
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325 3. One of the proposed algorithms should be implemented. They yield close performance 

326 but the method of Real Fitting coupled with a well checked dictionary should perform 

327 best. 

328 For the first algorithm (IVD) 

329 (a) The classic algorithm of Viterbi is implemented with the modification that, for 

330 eaoh P^r P = (sample, state) one propagates the time-frame of transition be- 

331 tween the state q G and the states of the HMM M for the path that arrives at P. 

332 These are inherited from the path that wins the entrance in the pair P, excepting 

333 for the moment when their decision is taken, namely when they receive the index 

334 of the corresponding sample. 

335 (b) w = - log P(M | X%) is computed by subtracting from the cumulated posterior 
335 that is returned by the Viterbi algorithm for the path Qp t , the value (N - (e t - 

337 k + l))*£t corresponding to the contribution of the states q G and dividing the 

338 result through e t -bt + l. e t -b t + l from the previous formula can be factored 

339 outside the fraction. 

340 (<0 The initialization of e is made with an expected mean value. One can use the w 

341 that is computed when the state q G is associated with an emission posterior equal 

342 to the average of the best K emission probabilities of the current sample as done 

343 in the well-known "garbage on-line model". In this case, K is trained using the 

344 corresponding technique. 

345 The next 'Beam search' algorithms,, are implemented according to the description in 
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346 the corresponding sections. For each pair P — {sample, state) one computes for each 

347 corresponding path the sum and length in the last phoneme, as well as the sum over 

348 the normalized cumulated posteriors of the previous phonemes (and their number). 

349 Also, the entrance and exit samples into the HMM M are computed and propagated 

350 like in the previous method, in order to ensure the localization of the subsequence. 

351 4. If one searched entity (keyword, sequence, object) can have several HMM models, all 

352 of them are taken into consideration as competitors. This is the case of the words 

353 with several pronunciations (or of the objects that have different structures in different 

354 states, for the recognition in images). 

355 After the computation of the confidence measure for each model of the subsequences, 

356 one eliminates those with a confidence measure in disagreement with a 'threshold 1 that 

357 is trained for the configuration and the goal of the given application. For example, for 

358 speech recognition with neuronal networks and minus of the logarithm of the posteriors, 

359 the 'threshold 1 is chosen in the wanted point of the ROC curve obtained in tests. 

360 5. The remained alternatives are extracted in the order of their confidence measure and 

361 with the elimination of the conflicting alternatives until exhaustion. Each time when 

362 an alternative is eliminated, the. searched entity with the corresponding HMM is re- 

363 estimated for the remaining sections in the sequence in which the search is performed. 

364 If the new confidence measure passes the test of the 'threshold', then it will be inserted 

365 in the position corresponding to its score in the queue of alternatives. 

366 6. The successful alternatives can undergo tests of superior levels like for example a 
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367 question of confirmation for speech recognition, opinion of one operator, etc. 

368 7. For objects recognition in images: 

369 Posteriors are obtained by computing a distance between the color of the model and 

370 that of element in the section of the image. If the context requires, the image will be 

371 preprocessed to ensure a certain normalization (Ex: changeable conditions of light will 

372 make necessary a transformation based on the histogram). 

373 The phonemes of the speech recognition correspond to parts of the object. The struc- 

374 ture (existence of transitions and their probabilities) can be modified, function of the 

375 characteristics detected along the current path. For example, after detecting regions 

376 of the object with certain lengths, one can estimate the expected length of the remain- 

377 ing regions. Thus, the number of the expected samples for the future states can be 

378 established and the HMM attached to the object will be configured accordingly. 

379 A direction is scanned for the detection of the best fitting and afterwards, other direc- 

380 tions will be scanned for discovering new fittings, as well as for testing the previous 

381 ones. The final test will be certified by classical methods such as cross-correlation or 

382 by the analysis of the contours in the hypothesized position. 

383 To mention some examples for the application of the proposed method: 

384 • The recognition of keywords begins to be used in answering automates of banking 

385 system as well as telephone and automates for control, sales or information. The 

386 method offers a possibility to recognize keywords in spontaneous speech with multiple 

387 speakers. 
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388 • The recognition of DNA sequences is important for the study of the human Genome. 

389 One of the biggest problem of the involved techniques consists in the high quantity of 

390 data that have to be processed. 

391 • The recognition of objects in images is used, among others, in cartography and in the 

392 coordination of industrial robots. The method allows a quick estimation of the position 

393 of the objects in scenes and can be validated with extra tests, using classical methods 

394 of cross-correlation. 
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395 WE CLAIM: 



396 1. (canceled) rewritten/re-presented in claim 5 

397 2. (canceled) rewritten/re-presented in claim 6 

398 3. (canceled) rewritten/re-presented in claim 7 

399 4. (canceled) rewritten/re-presented in claim 8 

400 5. (canceled) rewritten/re-presented in claim 9 

401 6. (canceled) rewritten/re-presented in claim 10 

402 7. (canceled) rewritten/re-presented in claim 11 

403 8. (canceled) rewritten/re-presented in claim 12 

404 9. (re-presented - formerly independent claim 5) A method of recognizing an observed 

405 subsequence as being generated by one of a set of Hidden Markov Models (HMM). 

406 characterized by: 

407 • the fact that it searches the subsequence, Q, that offer the minimization of an 

408 inverse confidence measure, over all possible matchings, 

409 • where the inverse confidence measure is one of 

410 1) the accumulated posterior, normalized with the length of the matched sub- 

411 sequence Xj; (aka. 'simple normalization 5 ) 

412 ^-^logP(Q|^) 
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413 2) partitioning the states in a HMM into phonems, having a function 

414 Phonemes(Q) that returns the segmentation of a path Q in the HMM into 

415 phonems, and computing one of: 

416 2a) the worst average match in a phoneme, called 'real fitting 1 , 

. / E 9 * 6 e-logP(?*|zib)v 
a 17 argminf max v , f , , , — n ) 

417 & g K QePhonemes(Q) \{k\q k <E Q}\ J 

418 2b) double normalization of the accumulated posterior over the number of 

419 phonemes, J, and over the number of acoustic samples, z$ — bj + 1, where 

420 ej is the time frame where Q enters phoneme j, and bj is the exit time 

421 frame from each phoneme, j, 

423 • and allows for the optional revaluation of the alternatives that offer the high- 

424 est scores of a mentioned confidence measure on the basis of another confidence 

425 measure, 

426 • and when based on the confidence measure called 'simple normalization' uses a 

427 method that applies Viterbi decoding for a HMM obtained by extending the initial 

428 one with a filler state just after start and one just before the termination state. 

429 and estimates the emission probability of the filler states in an iterative manner 

430 as being equal to the inverse confidence measure in the previous iteration, 

431 and where the emission probability in the filler states in the first iteration can be 

432 initialized to any floating point number, but the iteration stops: 
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433 i at convergence yielding the estimation of a keyword's boundaries and score 

434 as the obtained boundaries and score of non-filler states of the HMM, 

435 ii when the confidence measure descends under a threshold value, T, estimating 

436 only the keyword existence, 

437 iii when the emission probability of filler states, e 0 is initialized with T and is 

438 reestimated, as value of e\ at the end of the first iteration, to be higher than 

439 T deciding keyword inexistence, 

440 • or for any of the three confidence measures: 'simple normalization', 'double nor- 

441 malization' or 'real fitting' , uses a beam-search-like algorithm that considers the 

442 emission probability of the filler state as zero, computes progressively for each 

443 pair of sample and state of HMM a set of possible alternatives paths to reach it, 

444 the computation of this set is based on the sets of paths that lead to the states that 

445 can be associated to the previous sample and extended with transitions allowed 

446 by the analyzed HMM, 

447 where this set can be reduced by using appropriate (safe) rules for the given 

448 confidence measure, ensuring the correctness of the inference, 

449 and where this set can be also reduced by using heuristics, for speeding up the 

450 computation despite the risk of reducing the theoretical quality of the recognition, 

451 heuristics of which a fast version stores only the best match, 

452 and for all confidence measures one can prune the set of alternatives with safe rules 

453 guaranteeing optimality, where: 
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the 'simple normalization' confidence measure with beam-search is used with a 
safe pruning that discards a path Qi given the existence of a path Q 2 whenever 
52 < Si and L x < L2, where S\ and L x respectively 52 and L 2 are the minus of 
the cumulated log of posteriors along the paths, and the lengths of the paths, for 
the paths Q x respectively Q 2 , and which can be optimized by sorting competing 
paths based on their cost 

the 'double normalization' confidence measure on HMMs where no path skips any 
phoneme is used with a safe pruning that discards a path Qi given the existence 
of a path Q 2 whenever one of the following tests succeed: 

(a) l 2 >l u A > 0, B < Q and L\A + L C B + C > 0 

(b) l 2 >l u A > 0, B > 0 and C > 0 

(c) l 2 >l u A < 0, C > 0. and L 2 A + LB + C>0 

(d) Z 2 >ii, A - 0, B < 0 and LB + C > 0 

where we denote by a u p u l Xi respectively by a 2 , p 2 and l 2 the confidence measure 
for the previously visited phonemes, the posterior in the current phoneme and 
the length in the current phoneme for the path Q 1? respectively the path Q 2 , 
and we also use the notations A = a x - a 2j B — (a x ~ a 2 )(li + ^2) + Pi ™ P2, 
C=(ai - a 2 )lih +Pih ~ P2h> L = L max - m&x{l u l 2 }, L c = -B/2A and L max is 
the maximum acceptable length for a phoneme, 

the 'double normalization 5 confidence measure on HMMs where some paths skip 

phonemes is used with a safe pruning that discards a path Qi given the existence 

of a path Q 2 whenever l 2 > l u A > 0, p x > p 2 respectively F 2 >F U 
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where F x respectively F 2 are the number of visited phonemes for paths Qi and 

• the 'real fitting' is used with the safe pruning: Q 2 is discarded in favor of another 
path Qi if the confidence measure of the Real Fitting for the previous phonemes 
is inferior (higher in value) for Q 2 compared with Q l5 and if p x < p 2 and l 2 < l\, 
where p u l u respectively p 2 > h represent the minus of the logarithm of the cumu- 
lated posterior respectively the number of frames in the current phoneme for the 
path Qi respectively Q 2 , 

• and besides the previously mentioned safe pruning, heuristic prunings are also 
used for removing paths when p > L max P max in any state or when 2 > P max at 
the output from a phoneme, 

where p and 1 are.the values in the current phoneme for the minus of the logarithm 
of cumulated posterior and for the length of the path that is discarded. 

(re-presented - formerly dependent claim 6) The method of claim 9, where the method 
is used to estimate the existence of keywords and their position in utterances, using 
Hidden Markov Models that model keywords. 

(re-presented - formerly dependent claim 7) The method of claim 9, where the method 
is used to estimate the existence of biomolecular subsequences and their position in the 
chains of DNA using hidden Markov models to model the searched subsequences, and 
where these models can be obtained by trivial translation from generalized profiles. 

(re-presented - formerly dependent claim 8) The method of claim 9, where it carries out 
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497 the estimation of the existence of objects and their position in images, characterized 

498 by the fact that 

499 • it uses models of objects as subsequences represented by Hidden Markov Models, 

500 • namely sections through views of objects are modeled by Hidden Markov Models, 

501 • it uses emission probabilities based on a distance computed between colors, sim- 

502 pie distances being yield by a Gaussian with median at the target color, or a 

503 normalized inverse of the Euclidean distance in the RGB space, 

504 • wherein the Hidden Markov Models that model the objects can be structured of 

505 distinct regions, that play in the frame of the method the role of the phonemes 

506 in claim 9, 

507 • and wherein the models of the objects can be modified in a dynamic manner during 

508 decoding with respect to the transition properties (existence and probability) on 

509 the basis of the so far accumulated information in the process. 
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510 8 ABSTRACT OF THE DISCLOSURE 

511 The invention belongs to the technical domain of decoding, classification, alignment and 

512 matching of data. 

513 The invention introduces a new method performing tasks in keyword spotting in ut- 

514 terances, detection of subsequences in chains of organic matter (DNA and proteins) and 

515 recognition of objects in images. The proposed method searches in an optimized way the 

516 matching that maximizes, over all the possible matchings, certain confidence measures based 

517 on normalized posteriors. Three such confidence measures are used, two existed in previous 

518 work in Speech Recognition, and the third one is a new one. 

519 Application fields for this invention are: man-machine interfaces (using speech recogni- 

520 tion; ex: control systems, banking, flight services, etc), coordination systems (for industrial 

521 robots and automata) and development systems for pharmaceutic products. 
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